1 Introduction

Generalized linear models (GLM) as the name implies are a generalization of the linear modeling framework to allow for the modeling of response variables (e.g. soil attributes) with non-normal distributions and heterogeneous variances. Whereas linear models are designed for predicting continuous soil properties such as clay content or soil temperature, GLM can be used to predict the presence/absence of argillic horizons (i.e. logistic regression) or counts of a plant species along a transect (i.e. Poisson regression). These generalizations greatly expand the applicability of the linear modeling framework, while still allowing for a similar fitting procedure and interpretation of the resulting models.

In the past in order to handle non-linearity and heterogeneous variances, transformations have been made to the response variable, such as the log(x). However, such transformations complicate the models interpretation because the results refer to the transformed scale (e.g. log(x)). These response transformations are not guaranteed to achieve both normality and constant variance simultaneously. GLM approaches transform the response, but also preserve the scale of the response, and provide separate functions to transform the mean response and variance, known as the link and variance functions respectively. So instead of looking like this:

\(f(y) = \beta_{0} + \beta_{1}x + \varepsilon\)

you get this:

\(g(\mu)\) or \(\eta = \beta_{0} + \beta_{1}x + \varepsilon\)

with \(g(\mu)\) or \(\eta\) symbolizing the link function.

Another alteration of the classical linear model is that with GLM the coefficients are estimated iteratively by maximum likelihood estimation instead of ordinary least squares. This results in the GLM minimizing the deviance, instead of the sum of squares. However, for the Gaussian (i.e. normal) distributions the deviance and sum of squares are equivalent.

2 Logistic regression

Logistic regression is a specific type of GLM designed to model data that has a binomial distribution (i.e. presence/absence, yes/no, or proportional data), which in statistical learning parlance is considered a classification problem. For binomial data the logit link transform is generally used. The effect of the logit transform can be seen in the following figure. It creates a sigmoidal curve, which enhances the separation between the two groups. It also has the effect of ensuring that the values range between 0 and 1.

When comparing a simple linear model vs a simple logistic model we can see the effect of the logit transform on the relationship between the response and predictor variable. As before it follows a sigmoidal curve and prevents predictions from exceeding 0 and 1.

Examples of Logistic regression output showing probability of red clay parent material, mollisol and ponded components:

Example 1 Example 2 Example 3

3 Logistic regression rules of thumb

  • The response variable is discrete, i.e. binomial, present/absent, 0/1
  • The independent variables can be numeric or categorical
  • No assumptions for normality among independent variables
  • Check for highly correlated variables and select accordingly
  • The minimum number of cases per independent variable is 10:1
  • The preferred number of cases per independent variable is 20:1
  • The preferred number of cases per independent variable is 50:1 when using stepwise logistic regression

4 Logistic regression quick example

This example will provide a quick introduction to logistic regression by exploring the presence of soils with spodic characteristics in the Central Appalachians of West Virginia. Spodisols and soils with spodic properties form under the process of podzolization. The process of podzolization involves the removal (eluviation) of organic material, aluminum and iron from upper soil horizons (O, A and E) and the accumulation (illuviation) of these materials in the subsoil spodic horizon(s). In this region, these soils are associated with the past and present occurence of red spruce forest cover.

Load the required packages and set the working directory. Change the working directory to accomodate your working environment.

require(sp)
require(raster)
require(rgdal)
require(rms)

setwd("C:/workspace")

Select sample file, create data frame and view the first few records

githubURL <- "https://raw.githubusercontent.com/ncss-tech/stats_for_soil_survey/master/data/logistic/wv_transect_editedforR.csv"
pts <- read.csv(githubURL)
names(pts) 
##  [1] "x"            "y"            "Overtype"     "Underconifer"
##  [5] "Oi"           "Oe"           "Oa"           "Otot"        
##  [9] "epipedon"     "spodint"      "subgroup"     "order"       
## [13] "ps"           "drainage"     "series"       "taxon"       
## [17] "slope"        "surfacetex"   "stoniness"    "depthclass"  
## [21] "bedrockdepth" "depth_cm"     "hillslope"    "tipmound"    
## [25] "rainfall"     "geology"      "aachn"        "dem10m"      
## [29] "downslpgra"   "eastness"     "greenrefl"    "landsatb1"   
## [33] "landsatb2"    "landsatb3"    "landsatb7"    "maxc100"     
## [37] "maxent"       "minc100"      "mirref"       "ndvi"        
## [41] "northeastn"   "northness"    "northwestn"   "planc100"    
## [45] "proc100"      "protection"   "relpos11"     "slp50"       
## [49] "solar"        "tanc75"

Soil scientists developed a local, ad-hoc classification of the intensity of spodic expression for use in the region that is an ordered data type. This will need to be converted to a binary classification for modeling purposes.

Add a column called spod_pres_cons to the pts object that converts spodint to a binary variable

pts$spod_pres_cons <- ifelse(pts$spodint <= 1, 0, 1)

Create a model using dem10m, eastness, northness, and maxent to predict spod_pres_con and view the summary

GLM.1 <- lrm(spod_pres_cons ~ dem10m + eastness + northness + maxent, data=pts)
print(GLM.1)
## Logistic Regression Model
##  
##  lrm(formula = spod_pres_cons ~ dem10m + eastness + northness + 
##      maxent, data = pts)
##  
##                        Model Likelihood     Discrimination    Rank Discrim.    
##                           Ratio Test           Indexes           Indexes       
##  Obs           250    LR chi2      52.63    R2       0.268    C       0.773    
##   0            174    d.f.             4    g        1.339    Dxy     0.546    
##   1             76    Pr(> chi2) <0.0001    gr       3.817    gamma   0.547    
##  max |deriv| 1e-09                          gp       0.234    tau-a   0.232    
##                                             Brier    0.167                     
##  
##            Coef    S.E.   Wald Z Pr(>|Z|)
##  Intercept  7.4505 2.3284  3.20  0.0014  
##  dem10m    -0.0090 0.0023 -3.82  0.0001  
##  eastness  -0.8647 0.2563 -3.37  0.0007  
##  northness  0.6756 0.2310  2.93  0.0034  
##  maxent     0.0315 0.0086  3.67  0.0002  
## 

The summary will look similar to the following:

Summary output

Summary output

Evaluating the results involves review of several key values as noted in the summary figure:

  1. Is the ratio of observations to independent variables ok?
  2. What is the relationship between the dependent and independent variables? A low P value indicates there is a relationship between the DV and IV
  3. Is multicollinearity present? Standard Errors should be less than 2, which does not pertain to the intercept.
  4. Relationship of individual independent variables to the dependent variable. Small “p” values indicate the independent variable is a meaningful predictor
  5. Does the model perform better than random chance? “C” refers to the concordance aka c-index or AUC, with the following suggested scale (Hosmer & Lemeshow, 2013), 0.5 = no discrimination, 0.7 - 0.8 acceptable discrimination, 0.8 - 0.9 excellent discrimination, >0.9 outstanding discrimination
  6. What is the “Goodness of fit” for the model? The R2 of linear regression does not exist for Logistic regression. A measure called the pseudo R squared is only roughly analogous. There are several methods for calculating the pseudo R squared. In general, the higher the value the greater the variability that is explained by the independent variables.

Run a prediction of the model using the dem10m, eastness, northness, and maxent raster files. These files will be downloaded, a raster stack will be built and the GLM.1 model applied. When using your own data, the “stack” code only works if all rasters are co-registered, are .img files, have the same projection and spatial extent, and are stored in your working directory. In practice, other GDAL file formats should also work.

githubURL <- "https://raw.githubusercontent.com/ncss-tech/stats_for_soil_survey/master/data/logistic/wv_raster.RData"
load(url(githubURL))

# load("C:/workspace/wv_raster.RData")
# glm_r <- raster("C:/workspace/spodic_pres_GLM1.img")

glm_r <- predict(rasters, GLM.1, filename = "spodic_pres_GLM1.img", type = "fitted", progress = "text", overwrite = TRUE)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================                                                 |  25%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================                |  75%
  |                                                                       
  |=================================================================| 100%
## 
plot(glm_r)

5 Logistic regression detailed example

Now that we’ve discussed some of the basic background GLM theory we’ll move on to a real example, and address any additional theory where it relates to specific steps in the modeling process. The examples selected for this chapter come from Joshua Tree National Park (JTNP)(i.e. CA794) in the Mojave desert. The problem tackled here is a familiar one: Where can I expect to find argillic horizons on fan piedmonts? Argillic horizons within the Mojave are typically found on fan remnants, which are a stable landform that is a remnant of the Pleistocene (Peterson, 1981). Despite the low relief of most fans, fan remnants are uplands in the sense that they generally don’t receive run-on or active deposition.

With this dataset we’ll encounter some challenges. To start with, fan piedmont landscapes typically have relatively little relief. Since most of our predictors will be derivatives of elevation, that won’t leave us with much to work with. Also, our elevation data comes from the USGS National Elevation dataset (NED), which provides considerably less detail than say LiDAR or IFSAR data (Shi et al., 2012). Lastly our pedon dataset like most in NASIS, hasn’t received near as much quality control as have the components. So we’ll need to wrangle some of the pedon data before we can analyze it. These are all typical problems encountered in any data analysis and should be good practice. Ideally, it would be more interesting to try and model individual soil series with argillic horizons, but due to some of the challenges previously mentioned it would be difficult with this dataset. However, at the end we’ll look at one simple approach to try and separate individual soil series with argillic horizons.

5.1 Load packages

To start, as always we need to load some extra packages. This will become a familiar routine every time you start R. Most of the basic functions we need to develop a logistic regression model are contained in base R, but the following contain some useful spatial and data manipulation functions. Believe it or not we will use all of them and more.

library(aqp) # specialized soil classes and functions
library(soilDB) # NASIS and SDA import functions
library(raster) # guess
library(rgdal) # spatial import
library(ggplot2) # graphing
library(reshape2) # data manipulation
library(caret) # classification and regression training
library(car) # additional regression tools

5.2 Read in data

Hopefully like all good soil scientists and ecological site specialists you enter your field data into NASIS. Better yet hopefully someone else did it for you! Once data are captured in NASIS it is much easier to import the data into R, extract the pieces you need, manipulate it, model it, etc. If it’s not entered into NASIS, it may as well not exist.

# pedons <- fetchNASIS()
githubURL <- "https://raw.githubusercontent.com/ncss-tech/stats_for_soil_survey/master/data/ch7_data.Rdata"
load(url(githubURL))

str(pedons, max.level = 2) # Examine the makeup of the data we imported from NASIS.
## Formal class 'SoilProfileCollection' [package "aqp"] with 7 slots
##   ..@ idcol     : chr "peiid"
##   ..@ depthcols : chr [1:2] "hzdept" "hzdepb"
##   ..@ metadata  :'data.frame':   1 obs. of  1 variable:
##   ..@ horizons  :'data.frame':   4990 obs. of  43 variables:
##   ..@ site      :'data.frame':   1168 obs. of  79 variables:
##   ..@ sp        :Formal class 'SpatialPoints' [package "sp"] with 3 slots
##   ..@ diagnostic:'data.frame':   2133 obs. of  4 variables:

6 Exploratory analysis

6.1 Data wrangling

Generally before we begin modeling you should spend some time exploring the data. By examining a simple summary we can quickly see the breakdown of how many argillic horizons we have. Unfortunately, odds are good that all the argillic horizons haven’t been consistently populated in the diagnostic horizon table like they should be. Luckily for us, the desert argillic horizons always pop up in the taxonomic name, so we can use pattern matching to extract it. By doing this we gain an additional 11 pedons with argillic horizons and are able to label the missing values (i.e. NA). At a minimum for modeling purposes we probably need 10 pedons of the target we’re interested in and a total of 100 observations overall.

# Check consistency of argillic horizon population

s <- site(pedons) # get the site table

table(s$argillic.horizon, useNA = "ifany") # tabulate the number of argillic horizons observed
## 
## FALSE  TRUE  <NA> 
##   750   272   146
# or

# summary(s$argillic.horizon) 

# Extract argillic presence from the taxonomic subgroup

s$argillic <- grepl("arg", s$tax_subgroup)

table(s$argillic, useNA = "ifany")
## 
## FALSE  TRUE 
##   886   282

Ideally, if the diagnostic horizon table had been populated consistently we could have used the upper depth to diagnostic feature to filter out argillic horizons that start below 50cm, which may not be representative of “good” argillic horizons and may therefore have gotten correlated to a Torripsamments anyway. Not only are unrepresentative sites confusing for scientists, they’re equally confusing for models. However, as we saw earlier, some pedons don’t appear to be fully populated, so we’ll stick with those pedons that have the argillic specified in their taxonomic subgroup name, since it gives us the biggest sample.

d <- diagnostic_hz(pedons)
peiid <- d[d$diag_kind == "argillic horizon" & d$featdept < 50, "peiid"]
test <- s$peiid %in% unique(peiid)
summary(test)
##    Mode   FALSE    TRUE    NA's 
## logical     940     228       0

6.2 Geomorphic data

Another obvious place to look is at the geomorphic data in the site table. This information is intended to help differentiate where our soil observations exist on the landscape. If populated consistently it could potentially be used in future disaggregation efforts, as demonstrated by Nauman and Thompson (2014).

# Landform vs argillic presence

# Subset
s_sub <- subset(s, argillic == TRUE)

# Cross tabulate landform vs argillic horizon presence
test <- with(s_sub, 
             table(landform.string, argillic, useNA = "ifany")
             )
# Subset and print landform.string with > 3 observations
test[test > 3,]
##            alluvial fan               fan apron fan apron & fan remnant 
##                       9                      27                      25 
##             fan remnant                    hill               hillslope 
##                     109                      15                      32 
##                low hill                mountain          mountain slope 
##                       5                       6                       8 
##                pediment                    <NA> 
##                      11                       6
# generalize the landform.string
s$landform <- ifelse(grepl("fan|terrace|sheet|drainageway|wash", s$landform.string), "fan", "hill") 

Examining the above frequency table we can see that argillic horizons occur predominantly on fan remnants as was alluded too earlier. However, they also seem to occur frequently on other landforms - some of which are curious combinations of landforms or redundant terms.

# Hillslope position

# Subset fan landforms
s_sub <- subset(s, landform == "fan") 

# Cross tabulate and calculate proportions, the "2" calculates the proportions relative to the column totals
with(s_sub, round(
  prop.table(table(hillslope_pos, argillic, useNA = "ifany"), 2)
  * 100)
  ) 
##              argillic
## hillslope_pos FALSE TRUE
##     Toeslope     18    5
##     Footslope     2    2
##     Backslope    15   16
##     Shoulder      3    4
##     Summit       15   32
##     <NA>         47   40
# Slope shape

with(s_sub, round(
  prop.table(table(paste(shapedown, shapeacross), argillic, useNA = "ifany"), 2)
  * 100)
  )
##                  argillic
##                   FALSE TRUE
##   Concave Concave     1    1
##   Concave Convex      0    0
##   Concave Linear      4    1
##   Convex Concave      0    0
##   Convex Convex       8    8
##   Convex Linear       7    7
##   Linear Concave      7    3
##   Linear Convex      20   29
##   Linear Linear      41   45
##   Linear NA           0    0
##   NA NA              12    8

Looking at the hillslope position of fan landforms we can see a slightly higher proportion of argillic horizons are found on summits, while less are found on toeslopes. Slope shape doesn’t seem to provide any useful information for distinguishing argillic horizons.

# Surface morphometry, depth and surface rock fragments

# Recalculate gravel
s$surface_gravel <- with(s, 
                         surface_gravel - surface_fgravel
                         )
# Calculate the total surface rock fragments
s$frags <- apply(s[grepl("surface", names(s))], 1, sum) 

# Subset to just look and fans, and select numeric columns
s_sub <- subset(s, landform == "fan", select = c(argillic, bedrckdepth, slope_field, elev_field, frags)) 

s_m <- melt(s_sub, id = "argillic") # convert s_sub to wide data format
head(s_m, 2)
##   argillic    variable value
## 1    FALSE bedrckdepth    NA
## 2    FALSE bedrckdepth    11
ggplot(s_m, aes(x = argillic, y = value)) +
  geom_boxplot() +
  facet_wrap(~ variable, scale = "free")

Looking at our numeric variables only depth to bedrock seems to show much separation between the presence/absence of argillic horizons.

6.3 Soil Scientist Bias

Next we’ll look at soil scientist bias. The question being: Are some soil scientists more likely to describe argillic horizons than others? Due to the excess number of soil scientist that have worked on CA794, including detailees, we’ve filtered the names of soil scientist to include just the top 3 mappers and given priority to the most senior soil scientists when they occur together.

# Custom function to filter out the top 3 soil scientists
desc_test <- function(old) {
  old <- as.character(old)
  new <- NA
  # ranked by seniority
  if (is.na(old)) {new <- "other"}
  if (grepl("Stephen", old)) {new <- "Stephen"} # least senior
  if (grepl("Paul", old)) {new <- "Paul"} 
  if (grepl("Peter", old)) {new <- "Peter"} # most senior
  if (is.na(new)) {new <- "other"}
 return(new)
}

s$describer2 <- sapply(s$describer, desc_test)

s_sub <- subset(s, landform == "fan")

# By frequency
with(s_sub, table(describer2, argillic, useNA = "ifany"))
##           argillic
## describer2 FALSE TRUE
##    other     134   58
##    Paul       64   28
##    Peter     208   86
##    Stephen    58   19
# By proportion
with(s_sub, round(
  prop.table(table(describer2, argillic), margin = 1)
  * 100)
  )
##           argillic
## describer2 FALSE TRUE
##    other      70   30
##    Paul       70   30
##    Peter      71   29
##    Stephen    75   25

For fan landforms, none of the soil scientists seem more likely than the others to describe argillic horizons. However while this information is suggestive, it is far from definitive in showing a potential bias because it doesn’t take into account other factors. We’ll examine this more closely later.

6.4 Plot coordinates

Where do our points plot? We can plot the general location in R, but for this task we will export them to a Shapefile, so we can view them in a proper GIS, and really inspect them. Notice in the figure below the number of points that fall outside the survey boundary. What it doesn’t show is the points that may plot in the Ocean or Mexico!

# Convert soil profile collection to a spatial object
pedons2 <- pedons
slot(pedons2, "site") <- s # this is dangerous, but something needs to be fixed in the site() setter function
idx <- complete.cases(site(pedons2)[c("x", "y")]) # create an index to filter out pedons with missing coordinates
pedons2 <- pedons2[idx]
coordinates(pedons2) <- ~ x + y # set the coordinates
proj4string(pedons2) <- CRS("+init=epsg:4326") # set the projection
pedons_sp <- as(pedons2, "SpatialPointsDataFrame") # coerce to spatial object
pedons_sp <- spTransform(pedons_sp, CRS("+init=epsg:5070")) # reproject

# Read in soil survey area boundaries
# ssa <- readOGR(dsn = "F:/geodata/soils/soilsa_a_nrcs.shp", layer = "soilsa_a_nrcs")
# ca794 <- subset(ssa, areasymbol == "CA794") # subset out Joshua Tree National Park
# ca794 <- spTransform(ca794, CRS("+init=epsg:5070"))

# Plot
plot(ca794, axes = TRUE)
plot(pedons_sp, add = TRUE) # notice the points outside the boundary

# Write shapefile of pedons
writeOGR(pedons_sp, dsn = "C:/workspace", "pedons_sp", driver = "ESRI Shapefile") 

6.4.1 Exercise 1: View the data in ArcGIS

  • Examine the shapefile in ArcGIS along with our potential predictive variables (hint classify the Shapefile symbology using the argillic horizon column)
  • Discuss with your group, and report your observations or hypotheses

6.5 Extracting spatial data

Prior to any spatial analysis or modeling, you will need to develop a suite of geodata files that can be intersected with your field data locations. This is, in and of itself a difficult task, and should be facilitated by your Regional GIS Specialist. Typically, these geodata files would primarily consist of derivatives from a DEM or satellite imagery. Prior to any prediction it is also necessary to ensure the geodata files have the same projection, extent, and cell size. Once we have the necessary files we can construct a list in R of the file names and paths, read the geodata into R, and then extract the geodata values where they intersect with field data.

folder <- "F:/geodata/project_data/8VIC/ca794/"
files <- c(
  elev   = "ned30m_8VIC.tif", # elevation
  slope  = "ned30m_8VIC_slope5.tif", # slope gradient
  aspect = "ned30m_8VIC_aspect5.tif", # slope aspect
  twi    = "ned30m_8VIC_wetness.tif", # topographic wetness index
  twi_sc = "ned30m_8VIC_wetness_sc.tif", # transformed twi
  ch     = "ned30m_8VIC_cheight.tif", # catchment height
  z2str  = "ned30m_8VIC_z2stream.tif", # height above streams
  mrrtf  = "ned30m_8VIC_mrrtf.tif", # multiresolution ridgetop flatness index
  mrvbf  = "ned30m_8VIC_mrvbf.tif", # multiresolution valley bottom flatness index
  solar  = "ned30m_8VIC_solar.tif", # solar radiation
  precip = "prism30m_8VIC_ppt_1981_2010_annual_mm.tif", # annual precipitation
  precipsum = "prism30m_8VIC_ppt_1981_2010_summer_mm.tif", # summer precipitation
  temp   = "prism30m_8VIC_tmean_1981_2010_annual_C.tif", # annual temperature
  ls     = "landsat30m_8VIC_b123457.tif", # landsat bands
  pc     = "landsat30m_8VIC_pc123456.tif", # principal components of landsat
  tc     = "landsat30m_8VIC_tc123.tif", # tasseled cap components of landsat
  k      = "gamma30m_8VIC_namrad_k.tif", # gamma radiometrics signatures
  th     = "gamma30m_8VIC_namrad_th.tif",
  u      = "gamma30m_8VIC_namrad_u.tif",
  cluster = "cluster152.tif" # unsupervised classification
  )

# combine the folder directory and file names
geodata_f <- paste0(folder, files) 

# Create a raster stack
geodata_r <- stack(geodata_f)

# Extract the geodata and imbed in a data frame
data <- data.frame(
   as.data.frame(pedons_sp)[c("pedon_id", "taxonname", "frags", "x_std", "y_std", "describer2", "landform.string", "landform", "tax_subgroup")],
   extract(geodata_r, pedons_sp)
   )

# Modify some of the geodata variables
data$mast <- data$temp - 4
idx <- aggregate(mast ~ cluster, data = data, function(x) round(mean(x, na.rm = TRUE), 2))
names(idx)[2] <- "cluster_mast"
data <- merge(data, idx, by = "cluster", all.x = TRUE)

data$cluster <- factor(data$cluster, levels = 1:15)
data$cluster2 <- reorder(data$cluster, data$cluster_mast)
data$gsi <- with(data, (ls_3 - ls_1) / (ls_3 + ls_2 + ls_1))
data$ndvi <- with(data, (ls_4 - ls_3) / (ls_4 + ls_3))
data$sw <- cos(data$aspect - 255)
data$twi_sc <- abs(data$twi - 13.8) # 13.8 = twi median

# save(data, ca794, pedons, file = "C:/workspace/ch7_data.Rdata")

# Strip out location and personal information before uploading to the internet
# s[c("describer", "describer2", "x", "y", "x_std", "y_std", "utmnorthing", "utmeasting", "classifier")] <- NA
# slot(pedons, "site") <- s
# data[c("describer2", "x_std", "y_std")] <- NA
# save(data, ca794, pedons, file = "C:/workspace/stats_for_soil_survey/trunk/data/ch7_data.Rdata")

6.6 Examine spatial data

With our spatial data in hand, we can now see whether any of the variables will help us separate the presence/absence of argillic horizons. Because we’re dealing with a classification problem, we’ll compare the numeric variables using boxplots. What we’re looking for are variables with the least amount of overlap in their distribution (i.e. the greatest separation in their median values).

# Load data
load(file = "C:/workspace/ch7_data.Rdata")
train <- data

# Select argillic horizons with "arg" in the subgroup name and on fans
# Argillic horizons that occur on hills and mountains more than likely form by different process, and therefore would require a different model.train$argillic 
train$argillic <- ifelse(grepl("arg", train$tax_subgroup) & 
                           train$mrvbf > 0.15,
                         TRUE, FALSE
                         )
train <- subset(train, !is.na(argillic), select = - c(pedon_id, taxonname, x_std, y_std, landform.string, cluster, cluster_mast, argillic.horizon, tax_subgroup, frags)) 

train2 <- subset(train, select = - c(describer2, landform, cluster2))
data_m <- melt(train2, id = "argillic")

ggplot(data_m, aes(x = argillic, y = value)) +
  geom_boxplot() +
  facet_wrap(~ variable, scales = "free")

7 Modeling

7.1 Model training

Modeling is an iterative process that cycles between fitting and evaluating alternative models. Compared to tree and forest models, linear and generalized models require more input from the user. Automated model selection procedures are available, but are discouraged because they generally result in complex and unstable models. This is in part due to correlation amongst the predictive variables that can confuse the model. In addition, the order is which the variables are included or excluded from the model effects the significance of the others, and thus several weak predictors might mask the effect of one strong predictor. Therefore, it is best to begin with a selection of predictors that are known to be useful, and grow the model incrementally.

The example below is known as a forward selection procedure, where a full model is fit and compared against a null model, to assess the effect of the different predictors. For testing alternative models the Akaike’s Information Criterion (AIC) is used. When using the AIC to assess predictor significance, a smaller number is better.

full <- glm(argillic ~ ., data = train, family = binomial()) # "~ ." includes all columns in the data set
null <- glm(argillic ~ 1, data = train, family = binomial()) # "~ 1" just includes an intercept

add1(null, full, test = "Chisq") # using the AIC test the effect of adding additional predictors, generally select the predictor with the smallest AIC unless it goes against your intuition
## Single term additions
## 
## Model:
## argillic ~ 1
##            Df Deviance    AIC     LRT  Pr(>Chi)    
## <none>          872.18 891.63                      
## describer2  3   853.25 878.70  18.932 0.0002824 ***
## landform    1   768.81 790.26 103.377 < 2.2e-16 ***
## elev        1   869.79 891.24   2.398 0.1215117    
## slope       1   704.62 726.07 167.568 < 2.2e-16 ***
## aspect      1   869.85 891.30   2.328 0.1270350    
## twi         1   837.20 858.65  34.983 3.327e-09 ***
## twi_sc      1   690.72 712.17 181.463 < 2.2e-16 ***
## ch          1   870.81 892.26   1.370 0.2418957    
## z2str       1   805.75 827.20  66.429 3.628e-16 ***
## mrrtf       1   823.82 845.27  48.365 3.538e-12 ***
## mrvbf       1   819.41 840.86  52.775 3.740e-13 ***
## solar       1   861.83 883.28  10.353 0.0012926 ** 
## precip      1   864.20 885.65   7.986 0.0047148 ** 
## precipsum   1   854.28 875.73  17.901 2.327e-05 ***
## temp        1   870.08 891.53   2.102 0.1470626    
## ls_1        1   871.93 893.38   0.251 0.6163244    
## ls_2        1   869.79 891.24   2.390 0.1220857    
## ls_3        1   864.88 886.33   7.300 0.0068948 ** 
## ls_4        1   860.60 882.05  11.579 0.0006670 ***
## ls_5        1   846.15 867.60  26.034 3.355e-07 ***
## ls_6        1   847.19 868.64  24.997 5.741e-07 ***
## pc_1        1   855.78 877.23  16.399 5.132e-05 ***
## pc_2        1   836.95 858.40  35.233 2.926e-09 ***
## pc_3        1   858.12 879.57  14.061 0.0001769 ***
## pc_4        1   869.35 890.80   2.828 0.0926066 .  
## pc_5        1   872.17 893.62   0.015 0.9017456    
## pc_6        1   863.85 885.30   8.338 0.0038833 ** 
## tc_1        1   861.58 883.03  10.606 0.0011273 ** 
## tc_2        1   869.32 890.77   2.863 0.0906327 .  
## tc_3        1   836.92 858.37  35.260 2.886e-09 ***
## k           1   870.13 891.58   2.051 0.1521487    
## th          1   871.26 892.71   0.920 0.3374110    
## u           1   872.02 893.47   0.162 0.6877600    
## mast        1   870.08 891.53   2.102 0.1470626    
## cluster2   12   760.01 803.46 112.172 < 2.2e-16 ***
## gsi         1   835.26 856.71  36.921 1.230e-09 ***
## ndvi        1   868.77 890.22   3.409 0.0648437 .  
## sw          1   872.16 893.61   0.023 0.8802474    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see as the boxplots showed earlier that twi_sc has the smallest AIC and reduces the deviances the most. So let’s add twi_sc to the null model using the update() function. Then continue using the add1() or drop1() functions, until the model is saturated.

argi_glm <- update(null, . ~ . + twi_sc) # add twi_sc to the model, "-" will subtract predictors

# or refit

# argi_glm <- glm(argillic ~ twi_sc, data = train, family = binomial(link = "cloglog"))

# add1(argi_glm, full, test = "Chisq") # iterate until the model is saturated

# drop1(argi_glm, test = "Chisq") # test effect of dropping a predictor

argi_glm <- glm(argillic ~ twi_sc + slope + ls_1 + ch + z2str + mrvbf, data = train, family = binomial())

summary(argi_glm) # examine the effect and error for each predictors
## 
## Call:
## glm(formula = argillic ~ twi_sc + slope + ls_1 + ch + z2str + 
##     mrvbf, family = binomial(), data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.52678  -0.56621  -0.11136  -0.00255   2.87734  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.951983   0.810486   4.876 1.08e-06 ***
## twi_sc      -0.630857   0.097191  -6.491 8.53e-11 ***
## slope       -0.248590   0.053932  -4.609 4.04e-06 ***
## ls_1        -0.024334   0.009363  -2.599  0.00935 ** 
## ch          -0.002913   0.001175  -2.479  0.01318 *  
## z2str       -0.020934   0.006033  -3.470  0.00052 ***
## mrvbf       -0.285661   0.133716  -2.136  0.03265 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 875.83  on 989  degrees of freedom
## Residual deviance: 600.54  on 983  degrees of freedom
##   (40 observations deleted due to missingness)
## AIC: 614.54
## 
## Number of Fisher Scoring iterations: 8

After the model is saturated you should end up with a model similar to the one above.

7.2 Model evaluation

After we’re satisfied no additional variables will improve the fit, we need to evaluate it’s residuals, collinearity, accuracy, and model coefficients.

# Standard diagnostic plots for glm() objects
# plot(argi_glm) # plot regression diagnostics

# Term and partial residual plots
# termplot(argi_glm, partial.resid = TRUE)

The variance inflation factor (VIF) is used to assess collinearity amongst the predictors. Its square root indicates the amount of increase in the predictor coefficients standard error. A value greater than 2 indicates a doubling the standard error. Rules of thumb vary, but a square root of vif greater than 2 or 3 indicates an unacceptable value.

# Variance inflation, greater than 5 or 10 is bad
vif(argi_glm)
##   twi_sc    slope     ls_1       ch    z2str    mrvbf 
## 1.100552 1.684762 1.170740 1.163639 1.373601 1.879715

Because we’re dealing with a classification problem, we have to consider both errors of commission (Type I) and omission (Type II), or their corresponding accuracies of sensitivity (producer’s accuracy) and positive predicted value (user’s accuracy or precision) respectively. Before we can assess the error, however, we need to select a probability threshold.

  • Sensitivity and specificity examine how well the ground truth or reference data compares to the predictions.
  • Positive and negative predicted values (user’s accuracy) examine the inverse concept of how well the predictions match the reference data
comp <- data.frame(train[c("argillic", "cluster2")], 
                   pred = predict(argi_glm, train, type = "response") > 0.5
                   )
confusionMatrix(comp$pred, comp$argillic, positive = "TRUE")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE   803  113
##      TRUE     27   47
##                                           
##                Accuracy : 0.8586          
##                  95% CI : (0.8353, 0.8797)
##     No Information Rate : 0.8384          
##     P-Value [Acc > NIR] : 0.0443          
##                                           
##                   Kappa : 0.3336          
##  Mcnemar's Test P-Value : 6.779e-13       
##                                           
##             Sensitivity : 0.29375         
##             Specificity : 0.96747         
##          Pos Pred Value : 0.63514         
##          Neg Pred Value : 0.87664         
##              Prevalence : 0.16162         
##          Detection Rate : 0.04747         
##    Detection Prevalence : 0.07475         
##       Balanced Accuracy : 0.63061         
##                                           
##        'Positive' Class : TRUE            
## 
# Deviance squared
D2 <- with(argi_glm, 
           round((null.deviance - deviance) / null.deviance, 2)
           )

# Adjusted deviance squared
adjD2 <- with(argi_glm, 
              round(1 - ((df.null / df.residual) * (1 - D2)), 2)
              )
adjD2
## [1] 0.31
  • Discuss the variability of the predictions across the clusters, perhaps different models need to be constructed in each cluster, some clusters appear to be dominated by specific soil series, these data aren’t clean enough (nor are the series concepts usually) to model series separately, however, we could use the clusters as an additional model to attempt to separate the series. Do the hyperthermic clusters perform differently.
comp_sub <- subset(comp, argillic == TRUE)
temp <- by(comp_sub, list(comp_sub$cluster), function(x) with(x, data.frame(
  cluster = unique(cluster2),
  sum_arg = sum(argillic, na.rm = T),
  sum_pred = sum(pred, na.rm = T),
  sensitivity = round(sum(pred == argillic) / length(argillic), 2)
  )))
temp <- do.call(rbind, temp)
temp
##    cluster sum_arg sum_pred sensitivity
## 5        5      32       14        0.44
## 15      15      27        6        0.22
## 14      14      20       11        0.55
## 12      12       1        0        0.00
## 2        2      32        3        0.09
## 13      13       8        1        0.12
## 3        3      13        6        0.46
## 6        6       5        1        0.20
## 7        7      17        2        0.12
## 8        8       5        3        0.60
ggplot(temp, aes(x = cluster, y = sensitivity)) +
  geom_point()

# Remove hyperthermic points         
train_sub <- subset(train, temp < 22 + 4)

# full <- glm(argillic ~ ., data = train_sub, family = binomial(link = "cloglog"))
# null <- glm(argillic ~ 1, data = train_sub, family = binomial(link = "cloglog"))
# add1(null, full, train = "Chisq")

sub_glm <- glm(argillic ~ slope + twi_sc + ls_1 + mrvbf + z2str + ch, data = train_sub, family = binomial())

# summary(sub_glm)

comp <- data.frame(train_sub[c("argillic", "cluster2")], 
                   pred = predict(sub_glm, train_sub, type = "response") > 0.4
                   )
confusionMatrix(comp$pred, comp$argillic, positive = "TRUE")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE   375   40
##      TRUE     53   58
##                                           
##                Accuracy : 0.8232          
##                  95% CI : (0.7879, 0.8549)
##     No Information Rate : 0.8137          
##     P-Value [Acc > NIR] : 0.3103          
##                                           
##                   Kappa : 0.4452          
##  Mcnemar's Test P-Value : 0.2134          
##                                           
##             Sensitivity : 0.5918          
##             Specificity : 0.8762          
##          Pos Pred Value : 0.5225          
##          Neg Pred Value : 0.9036          
##              Prevalence : 0.1863          
##          Detection Rate : 0.1103          
##    Detection Prevalence : 0.2110          
##       Balanced Accuracy : 0.7340          
##                                           
##        'Positive' Class : TRUE            
## 
comp_sub <- subset(comp, argillic == TRUE)

temp <- by(comp_sub, list(comp_sub$cluster2), function(x) with(x, data.frame(
  cluster = unique(cluster2),
  sum_arg = sum(argillic, na.rm = T),
  sum_pred = sum(pred, na.rm = T), 
  sensitivity = round(sum(pred == argillic) / length(argillic), 2)
  )))
temp <- do.call(rbind, temp)
temp
##    cluster sum_arg sum_pred sensitivity
## 5        5      32       18        0.56
## 15      15      27       17        0.63
## 14      14      20       15        0.75
## 12      12       1        0        0.00
## 2        2      17        8        0.47
## 13      13       1        0        0.00
ggplot(temp, aes(x = cluster, y = sensitivity)) +
  geom_point()

# Examine the coefficients
summary(argi_glm)
## 
## Call:
## glm(formula = argillic ~ twi_sc + slope + ls_1 + ch + z2str + 
##     mrvbf, family = binomial(), data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.52678  -0.56621  -0.11136  -0.00255   2.87734  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.951983   0.810486   4.876 1.08e-06 ***
## twi_sc      -0.630857   0.097191  -6.491 8.53e-11 ***
## slope       -0.248590   0.053932  -4.609 4.04e-06 ***
## ls_1        -0.024334   0.009363  -2.599  0.00935 ** 
## ch          -0.002913   0.001175  -2.479  0.01318 *  
## z2str       -0.020934   0.006033  -3.470  0.00052 ***
## mrvbf       -0.285661   0.133716  -2.136  0.03265 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 875.83  on 989  degrees of freedom
## Residual deviance: 600.54  on 983  degrees of freedom
##   (40 observations deleted due to missingness)
## AIC: 614.54
## 
## Number of Fisher Scoring iterations: 8
# Convert the coefficients to an odds scale, who here gambles?
round(binomial(link = "logit")$linkinv(coef(argi_glm)), 2) 
## (Intercept)      twi_sc       slope        ls_1          ch       z2str 
##        0.98        0.35        0.44        0.49        0.50        0.49 
##       mrvbf 
##        0.43
# Importance of each predictor assessed by the amount of deviance they explain
anova(argi_glm) 
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: argillic
## 
## Terms added sequentially (first to last)
## 
## 
##        Df Deviance Resid. Df Resid. Dev
## NULL                     989     875.83
## twi_sc  1  182.883       988     692.95
## slope   1   54.734       987     638.22
## ls_1    1   18.532       986     619.68
## ch      1    3.815       985     615.87
## z2str   1   10.608       984     605.26
## mrvbf   1    4.719       983     600.54
  • View the results in ArcGIS and examine the accuracy at individual points
  • Discuss the effects of data quality, including both NASIS and GIS
  • Discuss how the modeling process isn’t an end in itself, but serves to uncover trends, possibly generate additional questions and direct future investigations
# Custom function to return the predictions and their standard errors
predfun <- function(model, data) {
  v <- predict(model, data, type = "response", se.fit = TRUE)
  cbind(
    p = as.vector(v$fit),
    se = as.vector(v$se.fit)
    )
  }
  
# Generate spatial predictions
# r <- predict(geodata_r, argi_glm, fun = predfun, index = 1:2, progress = "text")

# Export the results
# writeRaster(r[[1]], "argi.tif", overwrite = T, progress = "text")
# writeRaster(r[[2]], "argi_se.tif", overwrite = T, progress = "text")

plot(raster("C:/workspace/argi.tif"))
plot(ca794, add = TRUE)

plot(raster("C:/workspace/argi_se.tif"))
plot(ca794, add = TRUE)

# Download clipped example from Pinto Basin Joshua Tree
githubURL <- "https://raw.githubusercontent.com/ncss-tech/stats_for_soil_survey/master/data/logistic/argi_pb.zip"
download.file(githubURL, destfile = "C:/workspace/argi_pb.zip")
unzip(zipfile="C:/workspace/argi_pb.zip", exdir="C:/workspace")

7.2.1 Exercise 2: View the prediction in ArcGIS

  • Examine the raster predictions in ArcGIS and compare them to the Shapefile of that contains the original observations (hint classify the Shapefile symbology using the argillic column)
  • Discuss with your group, and report your observations or hypotheses

8 References

Beaudette, D. E., & O’Geen, A. T, 2009. Quantifying the aspect effect: an application of solar radiation modeling for soil survey. Soil Science Society of America Journal, 73:1345-1352

Gessler, P. E., Moore, I. D., McKenzie, N. J., & Ryan, P. J, 1995. Soil-landscape modelling and spatial prediction of soil attributes. International Journal of Geographical Information Systems, 9:421-432

Gorsevski, P. V., Gessler, P. E., Foltz, R. B., & Elliot, W. J, 2006. Spatial prediction of landslide hazard using logistic regression and ROC analysis. Transactions in GIS, 10:395-415

Evans, D.M. and Hartemink, A.E., 2014. Digital soil mapping of a red clay subsoil covered by loess. Geoderma, 230:296-304.

Hosmer Jr, D.W., Lemeshow, S. and Sturdivant, R.X., 2013. Applied logistic regression (Vol. 398). John Wiley & Sons

Kempen, B., Brus, D. J., Heuvelink, G., & Stoorvogel, J. J. (2009). Updating the 1: 50,000 Dutch soil map using legacy soil data: A multinomial logistic regression approach. Geoderma, 151:311-326.

Nauman, T. W., and J. A. Thompson, 2014. Semi-automated disaggregation of conventional soil maps using knowledge driven data mining and classification trees. Geoderma 213:385-399. http://www.sciencedirect.com/science/article/pii/S0016706113003066

Peterson, F.F., 1981. Landforms of the basin and range province: defined for soil survey. Nevada Agricultural Experiment Station Technical Bulletin 28, University of Nevada - Reno, NV. 52 p. http://jornada.nmsu.edu/files/Peterson_LandformsBasinRangeProvince.pdf

Shi, X., L. Girod, R. Long, R. DeKett, J. Philippe, and T. Burke, 2012. A comparison of LiDAR-based DEMs and USGS-sourced DEMs in terrain analysis for knowledge-based digital soil mapping. Geoderma 170:217-226. http://www.sciencedirect.com/science/article/pii/S0016706111003387

9 Additional reading

Lane, P.W., 2002. Generalized linear models in soil science. European Journal of Soil Science 53, 241- 251. http://onlinelibrary.wiley.com/doi/10.1046/j.1365-2389.2002.00440.x/abstract

James, G., D. Witten, T. Hastie, and R. Tibshirani, 2014. An Introduction to Statistical Learning: with Applications in R. Springer, New York. http://www-bcf.usc.edu/~gareth/ISL/

Hengl, T. 2009. A Practical Guide to Geostatistical Mapping, 2nd Edt. University of Amsterdam, www.lulu.com, 291 p. ISBN 978-90-9024981-0. http://spatial-analyst.net/book/system/files/Hengl_2009_GEOSTATe2c0w.pdf